Educative: Interactive Courses for Software Developers

In anomaly detection systems, we usually want to identify if we have an anomaly right now, and send an alert.

To identify if the last data point is an anomaly, we start by calculating the mean and standard deviation for each status code in the past hour:

CREATE TABLE server_log_summary (
   status_code int,
   period timestamptz,
   entries int
);
\COPY server_log_summary(status_code, period, entries) FROM '/data.csv' DELIMITER ',' CSV HEADER;
WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
        period > '2020-08-01 17:00 UTC'::timestamptz
   GROUP BY
       status_code
)
SELECT * FROM stats;

Enter to Rename, Shift+Enter to Preview

To get the last value in a GROUP BY and the mean and standard deviation, we used a little array trick.

Next, we calculate the z-score for the last value for each status code:

WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
       period > '2020-08-01 17:00 UTC'::timestamptz
   GROUP BY
       status_code
)
SELECT
   *,
   (last_value - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
FROM
   stats;

Enter to Rename, Shift+Enter to Preview

We calculated the z-score by finding the number of standard deviations between the last value and the mean. To avoid a “division by zero” error, we transform the denominator to NULL.

Looking at the z-scores we got, we can spot that status code 400 received a very high z-score of 6. In the past minute, we returned a 400 status code 24 times, which is significantly higher than the average of 0.73 in the past hour.

Let’s take a look at the raw data:

Enter to Rename, Shift+Enter to Preview

It does look like in the last couple of minutes, and we are getting more errors than expected.

400 status code entries

What our naked eye missed in the chart and the raw data was found by the query and was classified as an anomaly. We are off to a great start!

Detecting Anomalies

Analyzing A Server Log

Backtesting

Improving Accuracy

Conclusion

Identifying Anomalies